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(57) ABSTRACT 

The present invention provides a method and apparatus for 
generating a database search result. The creation of the 
search result is achieved by representing the subdocument 
lists of an inverted database with encoded bit strings. The 
encoded bit strings are space efficient methods of storing the 
correspondence between terms in the database and their 
occurrence in subdocuments. Logical combinations of these 
bit strings are then obtained by identifying the intersection, 
union, and/or inversion of a plurality of the bit strings. Since 
keywords for a database search can be identified by selecting 
the terms of the inverted database, the logical combinations 
of bit strings represent search results over the database. This 
technique for method for generating a search result is 
computationally efficient because computers combine bit 
strings very efficiently. Also, the search elements of the 
present invention are not just limited to keywords. The 
search elements also include types of fields (e.g., date or 
integer fields) or other extracted entities. 

11 Claims, 4 Drawing Sheets 
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METHOD AND APPARATUS USING RUN It is still a further object of the present invention to 

LENGTH ENCODING TO EVALUATE A analyze data records in a database by efficiently representing 

DATABASE the results of keyword tests against the database. 

It is still a further object of the present invention to 
5 analyze data records in a database by efficiently combining 
the results of keyword tests against the database. 

This is a divisional of U.S. patent application Ser. No. ] t i s still a further object of the present invention to 

09/203,408 filed Dec. 2, 1998 now U.S Pat. No. 6,112,204, analyze data records in a database by efficiently representing 

which is a division of U.S. patent application Ser. No. the results of field type tests against the database. 

08/900,562 filed Jul. 25, 1997, which issued on Apr. 6, 1999 10 ft fa ^ & ^ ^ q{ ^ to 

as U.S. Pat. No. 5,893,094. analyze date records in a database by efficiently combining 

FIELD OF THE INVENTION me resuks of fi*ld type tests against the database. 

This invention relates to the field of computerized infor- SUMMARY OF THE INVENTION 

mation search and retrieval systems and, more particularly, " ^ ^ a method and 

toa method and apparatus for comparing database search for a ^ analysis fc ^ by 

resu * representing the subdocument lists of an inverted database 

BACKGROUND OF THE INVENTION with encoded bit strings. The encoded bit strings are space 

„ ... . , . 1 1- • , 20 efficient methods of storing the correspondence between 

Information a increasingly being represented as digital tems m tfae da(abase and , heir occutrence m .^documents, 

bits of data and stored within electronic databases. These ^binations o£ mese bit strings are ^ obtained 

databases often include extremely large numbers of records b identifying ^ intersection, union, and/or inversion of a 

containing data fields reflecting an endless variety of - j. of the bu s(rm since keywords for a d atabase 

objects. Some databases, for example, contain the full text of search can be ident i fied 5y Meeting the terms of the inverted 

judicial opinions issued by every court in the United States database> the logical C o mb i nat ions of bit strings represent 

for the past one hundred and fifty years. Other databases search resul(s oyer me database ^ techniqU e for gener- 

may be filled with data fields containing particularized . a r£SuU ^ mmpati&auaj efficient because 

information about vast numbers of individuals (e.g., names, . ters bit strings very efficiently. The search 

addresses telephone numbers etc.). As more informaUon is c]cments of th e pceseai invent ion are not just limited to 

stored in these databases, the larger these data compilations keywords . ^ search elements ^ also lypes of 

become. fields (e.g., date or integer fields) or other extracted entities. 

Among the many advantages associated with electronic ^ othcr asp ects and advantages of the present 

storage is the fact that any given database can be searched invention will become better understood with reference to 

for the purpose of retrieving individual data records (e.g., 35 me following description, drawings, and appended claims, 
documents) that may be of particular interest to the user. One 

of the ways to perform this search is to simply determine BRIEF DESCRIPTION OF THE DRAWINGS 
which data records, if any, contain a certain keyword. This ^ ^ be descfibed m ^ ^ refefence tQ 
determination is accomphshed by comparing the keyword ^ m m which ^ reference 
with each record m the database and assessing whether the ^ refef {Q ^ elementfi and wherein . 
keyword is present or absent. In addition, database users can -...„•*• r 
search for data records that contain a variety of keyword . FIG * 1 » ™ Nation of a computer system for search- 
combinations (e.g., "cats" and "dogs", etc.). This operation, m S a database according to the present invention, 
known as a Boolean search, uses the conjunctions "AND", FIG. 2 is a flowchart that illustrates a process for inverting 
"OR", and "NOT" (among others) to join keywords in an 4S a database. 

effort to more precisely define and/or simplify the database FIG. 3 is a flowchart that illustrates a process for search- 
search. For example, if a user joins the keywords "cats" and ing a database according to the present invention, 
"dogs" with the conjunction "AND" and inputs the query FIG. 4 is an illustration of combining bit strings, 
"cats AND dogs", only those records that contain both the piG. 5 is a flowchart that illustrates a process for the union 
term "cats" and the term "dogs" will be retrieved. SQ combination of bit strings according to the present inven- 

The problem with this Boolean search however, is that a tion. 

computer typically makes use of substantial memory space . piG. 6 is a flowchart that illustrates a process for the 

and computing time to perform logical combinations of sets intersection combination of bit strings according to the 

of documents corresponding to the keyword search results. present invention. 

It is therefore desireable to create a system that performs ss 

logical combinations on set elements that is space and DETAILED DESCRIPTION OF. THE 

computation time efficient. INVENTION 

OBJECTS OF THE INVENTION * illustrates a computer system for searching data- 
bases. The computer. 20 consists of a central processing unit 

It is an object of the present invention to analyze data ^ (CPU) 30 and main memory 40. The computer 20 is coupled 

records in a database. t0 m input/Output (I/O) System 10 that includes a display 5, 

It is a further object of the present invention to analyze a keyboard 7 and mouse 9. The computer 20 interacts with 

data records in a database by efficiently representing the a disk storage unit 50 and the I/O system 10 to search 

results of element tests against the database. databases that are stored on the disk storage unit 50. The 

It is another object of the present invention to analyze data 65 results of those searches are displayed to the user, or 

records in a database by efficiently combining the results of alternatively, used by computer 20 for further processing of 

element tests against the database. the information in the database. 
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According to the present invention, the database that is code. For example, this code for the subdocument described 

stored in disk storage unit 50 is inverted. In general, an ' above might be "{Xj, X3}", wherein X 1 represents the 

inverted database is a listing of all the terms of the database sequence "1111111111", X^ represents the sequence 

and the regions of text associated with those terms. FIG. 2 "00000000000000000000", and X3 represents the sequence 

illustrates a process for operating a computer system to 5 "1111". In this case, the variables used to compute each 

invert a database. In step 132, the computer 20 selects a compressed code (i.e., X ly X^ X3, etc.), are derived by 

document from the database in disk storage unit 50. In step denoting the number of "l's" followed by the number of 

134, the document is divided into subdocuments. In this "0V in each run. According to this notation, the code {25, 

process, for example, the computer 20 detects paragraph 3, 128, 14} could represent a sequence of twenty-five "l's", 

boundaries in the document and creates subdocuments that followed by three "0 V, followed by one hundred and 

generally correspond to the paragraphs in the document. twenty-eight "l's", followed by fourteen "0's", and so on. 

Long paragraphs may consist of multiple subdocuments and Alternatively, each run of "l's" and "0V in a given bit 

several short paragraphs may be included in a single sub- string could be encoded with a first indicator that identifies 

document. The subdocuments all have approximately the the polarity of the run as either a "1" or a "0" and a second 

same length. Furthermore, each subdocument is assigned a ^ indicator that identifies the total number of bits contained 

numerical identifier that identifies its location in the data- within the run. In this regard, each variable (i.e., X ly X2, X 3 , 

base. etc.) would be a two-number designation in which the first 

In steps 136 and 138 of FIG. 2 respectively, a subdocu- number would be the binary value and the second number 

ment is then selected and parsed by the computer 20. Parsing would be the length of the run for each of those values, such 

a subdocument generally involves listing the terms in the ^ as {1,25; 0,3; 1,128; 0,14}. 

subdocument. In this embodiment of the present invention, The inverted database in which the subdocument list 

the parsing process is accomplished by assigning linguistic associated with each term is represented by a run length 

structure to sequences of words in a sentence and listing encoding is stored in disk storage unit 50 and is operated on 

those terms or noun phrases of the subdocument that have by the computer 20 to perform a search. FIG. 3 is a flowchart 

semantic meaning. The parsing process can be implemented ^ that illustrates the search process. Initially in step 10, the 

by a variety of techniques known in the art such as the use computer 20 selects the inverted database (from among 

of lexicons, morphological analyzers or language grammar several that may be stored on disk storage unit 50) to be 

structures. searched. The selection is normally made by a user input to 

Once a subdocument has been parsed, step 140 generates the computer 20. Alternatively the selection could be made 

a term list associating subdocument terms (including noun 30 by the computer 20 based on predefined selected criteria, 

phrases) and the corresponding subdocument identifiers in Once the database has been selected in step 10, a query is 

which the terms occur. All the subdocuments for each created in step 20 and sent to computer 20. This query is 

document of the database are processed in this way and the created in a variety of conventional ways such as by a user 

list of terms and subdocuments is updated. Finally, all the typing the query on the keyboard or by highlighting text 

documents of a database are processed according to steps 35 from a document. The computer 20 the parses the query into 

132-140. The result of this inversion process is a term list a series of keywords joined by Boolean logic operators, 

identifying all the terms (including noun phrases in this Once the query is parsed, the computer 20 performs step 

example) of a database and the identity of the subdocuments 30 in which the compressed bit strings for each term in the 

in which the terms occur. query are retrieved. In this step the computer 20 also reduces 

In this embodiment of the present invention, each list of 40 the logical combination of query keywords into a combina- 

subdocuments associated with a term in the inverted data- tion of union, intersection and inversion operations for the 

base is represented and stored by a technique known as run compressed bit strings. For example, if the query called for 

length encoding. This approach recognizes that binary bit the Exclusive OR of the terms A and B (i.e., retrieve the 

strings typically consist of repeated sets of bits of the same documents having A or B but not those documents having A 

value (i.e., "l's" and "0's"), which can be encoded for later . 45 and B), then the set operators that are combined to create this 

application. Using this technique, long binary bit strings that search result is: (A intersect (inversion of B)) union (B 

span millions of characters can be efficiently compressed intersect (inversion of A)). The set operators union, inter- 

into notably smaller bit strings. section and inversion can be combined to create any Bool- 

In particular, the list of subdocuments of a database in ean logic operation. As a result, any search request can be 
which a term appears is represented by a series or bit string 50 executed by combining these set operations on the encoded 
of l's and 0's. Each subdocument is represented by a bit bit strings representing the occurance of terms in the data- 
position in this bit string. When a ' 1* occurs in this bit string, base. 

its position indicates the particular subdocument in the FIG. 4 illustrates the combination of compressed bit 

database in which a term occurs. When a '0' occurs in this strings for union and intersection. The individual bit strings 

bit string, its position indicates that tbe term did not occur in 55 for Query Term A 32 and Query Term B 34 are illustrated by 

that particular subdocument. A sample representation of a solid line representing 'l's and a blank representing *0's. 

subdocuments associated with a document in which a par- The shaded area in the intersection 36 and union 38 of A and 

ticular term appears might be B represents a *V. Although not shown in FIG. 3, the 

"1111111111000000000000000000001111." In this bit Inversion operator is simply accomplished by changing the 

string, the particular term appears in the first 10 eo polarity of each bit in the string. 

subdocuments, it does not appear in the next 20 subdocu- FIG. 5 illustrates a process for evaluating the union of sets 

ments and it appears in the next 4 subdocuments. A series of represented by run length encoded (RLE) bit strings. Ini- 

bit strings, wherein each bit represents a subdocument in the tially in Step 42 the overlapping range from a first and 

database, are then concatenated to represent the appearance second RLE is determined. In addition to the range of step 

of the particular term across the database; es 42, steps 44 adds ranges from the minimum of the first or 

Once the bit string for the entire database has been second overlapping RLE and adds range from the maximum 

generated, this bit string is then compressed into a single of the first or second overlapping RLE. Finally in Step 46 
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range is added when either RLE has non overlapping range 
in the other RLE. 

FIG. 6 illustrates the process for evaluating the intersec- 
tion of RLEs. In Step 52, overlapping RLEs are determined. 5 
In Step 54, range is generated from the maximum start of the 
first or second RLE until the minimum end of the first or 
second RLE. The combinations of the RLE bit strings shown 
in FIGS. 3-5 can of course be performed on any number (2 
or greater) of RLE bit strings. This is significant because a io 
database can be preprocessed to determine bit strings for 
many elements. When search results are required for any 
combination of the preprocessed elements, the RLE bit 
strings can be combined and the search result for the 
combination of elements is quickly generated. 15 

The process of operating the computer on the inverted and • 
encoded database as illustrated in FIGS. 2-6 is efficient in 
generating search results over large databases. This is 
because, generally, there are four major operators for 20 
manipulating sets. They are: union, intersection, inversion 
and testing for the existence of an element in the set. The use 
of run length encoding allows the computer to perform the 
operations of union, intersection and inversion efficiently. 
The set operation of testing for an element over the database 25 
does not need to be performed in responding to a query 
because that step has effectively been done when the data- 
base was inverted and encoded. As a result the process of the 
present invention generates results for database queries 
quickly and efficiently. 

The process of the present invention is not only useful for 
generating search results on Boolean combinations of key- 
words but it is also useful to efficiently generate search 
results on any Boolean combination of elements in a data- 35 
base. In particular, these elements can be types of fields or 
combinations of words. This is because the terms and their 
associated bit strings associated with terms can be catego- 
rized into types. For example, all dates can be combined and 
represented by a date field bit string. The search elements 40 
could also involve other extracted entities such as names, 
places, or relationships (such as a buyer in an acquisition). 
Database records can also be evaluated for the presence or 
absence of a sentences, characters, non-text objects (e.g., ^ 
icons, pictures, sound representations), other types of fields 
or bit sequences of any sort. A combination of RLE bit 
strings associated with these elements, and hence a search 
result, is efficiently generated by this embodiment of the 
present invention. 50 

Although the present invention has been described and 
illustrated in detail with reference to certain preferred 
embodiments thereof, other versions are possible. Upon 
reading the above description, it will become apparent to 
persons skilled in the art how to make changes in form or 55 
detail without departing from the substance of the invention. 
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I claim: 

1. A method of storing data, comprising: 

creating a plurality of subdocuments of approximately 
equal length from a database; 

representing the occurrence of a plurality of terms in each 
of said subdocuments by an encoded bit string; 

combining a plurality of said bit strings, wherein said 
combination represents a search result from said data- 
base. 

2. The method of claim 1, further comprising generating 
a comparison list indicating the relation between a first 
encoded bit string and a second encoded bit string. 

3. The method of claim 2, wherein said comparison list 
indicates the intersection between said first and said second 
encoded bit strings. 

4. The method of claim 2, wherein said comparison list 
indicates the union between said first and said second 
encoded bit strings. 

5. An apparatus for storing data, comprising; 

a computer coupled to a disk storage unit, said disk 
storage unit stores a database, said computer creates a 
plurality of subdocuments of approximately equal 
length from said database; 

said computer represents the occurrence of a plurality of 
terms in each of said subdocuments by an encoded bit 
string; 

said computer combines a plurality of said encoded bit 
strings; and 

said computer stores said plurality of said encoded bit 
strings. 

6. The apparatus of claim 5, wherein said processor 
further generates a comparison fist that indicates the relation 
between a first and a second encoded bit string. 

7. The apparatus of claim 5, wherein said comparison list 
indicates the intersection between said first and said second 
encoded bit string. 

8. The apparatus of claim 5, wherein said comparison list 
indicates the union between said first and said second 
encoded bit string. 

9. A method of retrieving data from a database, compris- 
ing the steps of: 

creating a plurality of subdocuments from a database; 
representing the occurrence of at least one term in said 

subdocuments by an encoded bit string; 
identifying said subdocuments containing said bit string; 

and 

retrieving said subdocuments containing said bit string. 

10. The method of claim 9, wherein said encoded bit 
string represents the number of sequential subdocuments 
that contain the occurrence of said term. 

11. The method of claim 9 further comprising the step of 
storing said subdocuments containing said bit string. 
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